New job queue: worker registration and leader election #3307

sandhose · 2024-10-07T10:09:20Z

This adds the base for the new job queue system, with a simple worker registration system, as well as a leader election system.

The worker registration is meant to be used to detect lost workers and reschedule dead tasks they locked.
The leader election system is meant to have one leader performing all the maintenance work, like rescheduling tasks.

Part of #2785

cloudflare-workers-and-pages · 2024-10-07T10:09:22Z

Deploying matrix-authentication-service-docs with Cloudflare Pages

Latest commit:	`76afd6a`
Status:	✅ Deploy successful!
Preview URL:	https://410771fa.matrix-authentication-service-docs.pages.dev
Branch Preview URL:	https://quenting-new-queue-initial.matrix-authentication-service-docs.pages.dev

View logs

reivilibre

This seems reasonable, but it also seems quite intricate and would benefit from a careful review, including aspects like whether it's robust against clock drift, node failures, etc — that sort of thing.

I would also want to carefully review what happens whether there are any problems if the current leader loses connection but still believes it is the leader.

reivilibre · 2024-10-09T17:05:39Z

crates/storage-pg/migrations/20241004075132_queue_worker.sql

+-- The leader is responsible for running maintenance tasks
+CREATE UNLOGGED TABLE queue_leader (
+  -- This makes the row unique
+  active BOOLEAN NOT NULL DEFAULT TRUE UNIQUE,


I know it sounds silly, but I'd make this a PRIMARY KEY — maybe this sounds dogmatic? But there a handful of tools are not happy with tables that don't have a primary key, e.g. logical replication in Postgres by default, I'd say it's worth always using it instead of UNIQUE etc.

reivilibre · 2024-10-09T17:13:09Z

crates/storage-pg/src/queue/worker.rs

+        // If no row was updated, the worker was shutdown so we return an error
+        DatabaseError::ensure_affected_rows(&res, 1)?;
+
+        Ok(worker)


the docstring says this returns the modified worker, but I don't see us modifying it.

I would expect the worker to track its own validity timestamps, but I guess the critical thing here is just that we 'take away' the Worker if we can't renew it?

reivilibre · 2024-10-09T17:15:29Z

crates/storage-pg/src/queue/worker.rs

+        clock: &dyn Clock,
+        threshold: Duration,
+    ) -> Result<(), Self::Error> {
+        let now = clock.now();


is it reasonable to rely on the system clock (which could drift between servers)?

I suppose we could use the Postgres database's clock alternatively. But I don't know which one is best, mostly just interested in considering it carefully

That's a fair point, although I'd imagine multiple servers in the same datacenter usually have the same time source/NTP server?

The clock is abstracted through a trait though, so maybe at some point we can take into account time drift, and regularly sync the local system clock with the database or something, but I wouldn't worry too much about it for now

We would like to use the underlying connection from the PgListener, which was added in a patch, but not yet merged or released.

sandhose added the A-Jobs Related to asynchronous jobs label Oct 7, 2024

sandhose requested a review from reivilibre October 7, 2024 10:10

sandhose force-pushed the quenting/new-queue/initial branch from 48e5507 to e419853 Compare October 9, 2024 08:29

reivilibre reviewed Oct 9, 2024

View reviewed changes

sandhose force-pushed the quenting/new-queue/initial branch from e419853 to 1370a04 Compare October 10, 2024 08:55

sandhose mentioned this pull request Oct 15, 2024

Rewrite the async job system #2785

Open

sandhose force-pushed the quenting/new-queue/initial branch from 2f19fff to 80aa6fa Compare October 15, 2024 12:48

sandhose force-pushed the quenting/new-queue/initial branch from 80aa6fa to f060abe Compare October 30, 2024 14:28

sandhose mentioned this pull request Oct 31, 2024

Insert jobs using the new queue #3367

Open

sandhose added 5 commits October 31, 2024 18:14

New job queue: worker registration and leader election

c92692b

Make the worker heartbeat take a worker reference

823174c

Move the worker logic in a struct

2801285

TEMP: use patched sqlx

f4897ca

We would like to use the underlying connection from the PgListener, which was added in a patch, but not yet merged or released.

Graceful shutdown

76afd6a

sandhose force-pushed the quenting/new-queue/initial branch from f060abe to 76afd6a Compare October 31, 2024 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New job queue: worker registration and leader election #3307

New job queue: worker registration and leader election #3307

sandhose commented Oct 7, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 7, 2024 •

edited

Loading

reivilibre left a comment

reivilibre Oct 9, 2024

reivilibre Oct 9, 2024

reivilibre Oct 9, 2024

sandhose Oct 31, 2024

New job queue: worker registration and leader election #3307

Are you sure you want to change the base?

New job queue: worker registration and leader election #3307

Conversation

sandhose commented Oct 7, 2024 • edited Loading

cloudflare-workers-and-pages bot commented Oct 7, 2024 • edited Loading

Deploying matrix-authentication-service-docs with Cloudflare Pages

reivilibre left a comment

Choose a reason for hiding this comment

reivilibre Oct 9, 2024

Choose a reason for hiding this comment

reivilibre Oct 9, 2024

Choose a reason for hiding this comment

reivilibre Oct 9, 2024

Choose a reason for hiding this comment

sandhose Oct 31, 2024

Choose a reason for hiding this comment

sandhose commented Oct 7, 2024 •

edited

Loading

cloudflare-workers-and-pages bot commented Oct 7, 2024 •

edited

Loading